{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Twitter HOWTO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This document is an overview of how to use NLTK to collect and process Twitter data. It was written as an IPython notebook, and if you have IPython installed, you can download [the source of the notebook](https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/twitter.ipynb) from the NLTK GitHub repository and run the notebook in interactive mode.\n",
    "\n",
    "Most of the tasks that you might want to carry out with 'live' Twitter data require you to authenticate your request by registering for API keys. This is usually a once-only step. When you have registered your API keys, you can store them in a file on your computer, and then use them whenever you want. We explain what's involved in the section [First Steps](#first_steps).\n",
    "\n",
    "If you have already obtained Twitter API keys as part of some earlier project, [storing your keys](#store_keys) explains how to save them to a file that NLTK will be able to find. Alternatively, if you just want to play around with the Twitter data that is distributed as part of NLTK, head over to the section on using the [`twitter-samples` corpus reader](#corpus_reader).\n",
    "\n",
    "Once you have got authentication sorted out, we'll show you [how to use NLTK's `Twitter` class](#simple). This is made as simple as possible, but deliberately limits what you can do.  \n",
    "\n",
    "## <a name=\"first_steps\">First Steps</a>\n",
    "\n",
    "As mentioned above, in order to collect data from Twitter, you first need to register a new *application* &mdash; this is Twitter's way of referring to any computer program that interacts with the Twitter API. As long as you save your registration information correctly, you should only need to do this once, since the information should work for any NLTK code that you write. You will need to have a Twitter account before you can register. Twitter also insists that [you add a mobile phone number to your Twitter profile](https://support.twitter.com/articles/110250-adding-your-mobile-number-to-your-account-via-web) before you will be allowed to register an application.\n",
    "\n",
    "These are the steps you need to carry out.\n",
    "\n",
    "### <a name=\"api_keys\">Getting your API keys from Twitter</a>\n",
    "\n",
    "1. Sign in to your Twitter account at https://apps.twitter.com. You should then get sent to a screen that looks something like this:\n",
    "<img src=\"images/twitter_app1.tiff\" width=\"600px\">\n",
    "Clicking on the **Create New App** button should take you to the following screen:\n",
    "<img src=\"images/twitter_app2.tiff\" width=\"600px\">\n",
    "The information that you provide for **Name**, **Description** and **Website** can be anything you like.\n",
    "\n",
    "2. Make sure that you select **Read and Write** access for your application (as specified on the *Permissions* tab of Twitter's Application Management screen):\n",
    "<img src=\"images/twitter_app3.tiff\" width=\"600px\">\n",
    "\n",
    "3. Go to the tab labeled **Keys and Access Tokens**. It should look something like this, but with actual keys rather than a string of Xs:\n",
    "<img src=\"images/twitter_app4.png\" width=\"650px\">\n",
    "As you can see, this will give you four distinct keys: consumer key, consumer key secret, access token and access token secret.\n",
    "\n",
    "### <a name=\"store_keys\">Storing your keys</a>\n",
    "\n",
    "1. Create a folder named `twitter-files` in your home directory. Within this folder, use a text editor to create a new file called `credentials.txt`. Make sure that this file is just a plain text file. In it, you should create which you should store in a text file with the following structure:\n",
    "```\n",
    "app_key=YOUR CONSUMER KEY  \n",
    "app_secret=YOUR CONSUMER SECRET  \n",
    "oauth_token=YOUR ACCESS TOKEN  \n",
    "oauth_token_secret=YOUR ACCESS TOKEN SECRET\n",
    "```\n",
    "Type the part up to and includinge the '=' symbol exactly as shown. The values on the right-hand side of the '=' &mdash; that is, everything in caps &mdash; should be cut-and-pasted from the relevant API key information shown on the Twitter **Keys and Access Tokens**. Save the file and that's it.\n",
    "\n",
    "2. It's going to be important for NLTK programs to know where you have stored your\n",
    "   credentials. We'll assume that this folder is called `twitter-files`, but you can call it      anything you like. We will also assume that this folder is where you save any files            containing tweets that you collect. Once you have decided on the name and location of this \n",
    "   folder, you will need to set the `TWITTER` environment variable to this value. \n",
    "\n",
    "   On a Unix-like system (including MacOS), you will set the variable something like this:\n",
    "   ```bash\n",
    "   export TWITTER=\"/path/to/your/twitter-files\"\n",
    "   ```\n",
    "   Rather than having to give this command each time you start a new session, it's advisable      to add it to your shell's configuration file, e.g. to `.bashrc`.\n",
    "\n",
    "   On a Windows machine, right click on “My Computer” then select `Properties > Advanced >        Environment Variables > User Variables > New...` \n",
    "\n",
    "   One important thing to remember is that you need to keep your `credentials.txt` file          private. So do **not** share your `twitter-files` folder with anyone else, and do **not**      upload it to a public repository such as GitHub.\n",
    "\n",
    "3. Finally, read through Twitter's [Developer Rules of the Road](https://dev.twitter.com/overview/terms/policy). As far as these rules are concerned, you count as both the application developer and the user."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a name=\"twython\">Install Twython</a>\n",
    "\n",
    "The NLTK Twitter package relies on a third party library called [Twython](https://twython.readthedocs.org/). Install Twython via [pip](https://pip.pypa.io):\n",
    "```bash\n",
    "$ pip install twython\n",
    "```\n",
    "\n",
    "or with [easy_install](https://pythonhosted.org/setuptools/easy_install.html):\n",
    "\n",
    "```bash\n",
    "$ easy_install twython\n",
    "```\n",
    "We're now ready to get started. The next section will describe how to use the `Twitter` class to talk to the Twitter API."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*More detail*:\n",
    "Twitter offers are two main authentication options. OAuth 1 is for user-authenticated API calls, and allows sending status updates, direct messages, etc, whereas OAuth 2 is for application-authenticated calls, where read-only access is sufficient. Although OAuth 2 sounds more appropriate for the kind of tasks envisaged within NLTK, it turns out that access to Twitter's Streaming API requires OAuth 1, which is why it's necessary to obtain *Read and Write* access for your application."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a name=\"simple\">Using the simple `Twitter` class</a>\n",
    "\n",
    "### Dipping into the Public Stream\n",
    "\n",
    "The `Twitter` class is intended as a simple means of interacting with the Twitter data stream. Later on, we'll look at other methods which give more fine-grained control. \n",
    "\n",
    "The Twitter live public stream is a sample (approximately 1%) of all Tweets that are currently being published by users. They can be on any topic and in any language. In your request, you can give keywords which will narrow down the Tweets that get delivered to you. Our first example looks for Tweets which include either the word *love* or *hate*. We limit the call to finding 10 tweets. When you run this code, it will definitely produce different results from those shown below!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sana magkakaisa na ang mga Kapamilya at Kapuso.  Spread love, not hate\n",
      " #ShowtimeKapamiIyaDay #ALDubEBforLOVE\n",
      "@Real_Liam_Payne Please follow me , you mean the world to me and words can't describe how much i love you x3186\n",
      "Love my ugly wife\n",
      "RT @ansaberano: We Found Love\n",
      "#PushAwardsLizQuen\n",
      "RT @yungunmei: people want to fall in love but don't understand the concept\n",
      "I don't care, I love It  #EMABiggestFans1D\n",
      "RT @bryan_white: I'm not in the Philippines Yet but we are making a very BIG announcement in 2 days! Get ready! Love you! #GGMY #ALDubEBfor…\n",
      "I whole heartedly HATE @lakiamichelle like really HATE her 😩 who wants to be her friend because I DONT\n",
      "RT @lahrose23: I love yu to  https://t.co/dfsRwSp1IC\n",
      "RT @alone_in_woods: ahoj, já jsem tvůj pes a tohle je náš love song /// Zrní - Já jsem tvůj pes https://t.co/7L0XPHeA2d via @YouTube\n",
      "Written 10 Tweets\n"
     ]
    }
   ],
   "source": [
    "from nltk.twitter import Twitter\n",
    "tw = Twitter()\n",
    "tw.tweets(keywords='love, hate', limit=10) #sample from the public stream"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next example filters the live public stream by looking for specific user accounts. In this case, we 'follow' two news organisations, namely `@CNN` and `@BBCNews`. [As advised by Twitter](https://dev.twitter.com/streaming/reference/post/statuses/filter), we use *numeric userIDs* for these accounts. If you run this code yourself, you'll see that Tweets are arriving much more slowly than in the previous example. This is because even big new organisations don't publish Tweets that often.\n",
    "\n",
    "A bit later we will show you how to use Python to convert usernames such as `@CNN` to userIDs such as `759251`, but for now you might find it simpler to use a web service like [TweeterID](http://tweeterid.com) if you want to experiment with following different accounts than the ones shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Judge grants petition allowing @Caitlyn_Jenner to officially change her name and gender. http://t.co/HpCbAQ64Mk http://t.co/BPaKy2…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "Written 10 Tweets\n"
     ]
    }
   ],
   "source": [
    "tw = Twitter()\n",
    "tw.tweets(follow=['759251', '612473'], limit=10) # see what CNN and BBC are talking about"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving Tweets to a File\n",
    "\n",
    "By default, the `Twitter` class will just print out Tweets to your computer terminal. Although it's fun to view the Twitter stream zipping by on your screen, you'll probably want to save some tweets in a file. We can tell the `tweets()` method to save to a file by setting the flag `to_screen` to `False`. \n",
    "\n",
    "The `Twitter` class will look at the value of your environmental variable `TWITTER` to determine which folder to use to save the tweets, and it will put them in a date-stamped file with the prefix `tweets`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing to /Users/ewan/twitter-files/tweets.20150926-154251.json\n",
      "Written 25 Tweets\n"
     ]
    }
   ],
   "source": [
    "tw = Twitter()\n",
    "tw.tweets(to_screen=False, limit=25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we've been taking data from the live public stream. However, it's also possible to retrieve past tweets, for example by searching for specific keywords, and setting `stream=False`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video] http://t.co/eY4GgKS3ak\n",
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video] http://t.co/Pflf7A6Tr6\n",
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video] http://t.co/mibYfNISBT http://t.co/9ElX70F4St\n",
      "Photo: “Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video]: Hillary... http://t.co/qIiWGk1jbM\n",
      "lena dunham and hilary clinton talking about feminism... l o l theyre the two most hypocritical and clueless about what feminism actually is\n",
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video]: \n",
      "Hillary Clinton An... http://t.co/31shf6VeEu\n",
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video]: \n",
      "Hillary Clinton An... http://t.co/uvft4LDS0t\n",
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video]: \n",
      "Hillary Clinton An... http://t.co/uEbc25V3E3\n",
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video]: \n",
      "Hillary Cl... http://t.co/RNgziN9eWA #bossip\n",
      "“Girls” Creator Lena Dunham Interviews Hilary Clinton About… Lenny Kravitz’s Junk [Video]: \n",
      "Hillary Clinton An... http://t.co/gkB5aLEJJP\n",
      "Written 10 Tweets\n"
     ]
    }
   ],
   "source": [
    "tw.tweets(keywords='hilary clinton', stream=False, limit=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a name=\"onwards\">Onwards and Upwards</a>\n",
    "\n",
    "In this section, we'll look at how to get more fine-grained control over processing Tweets. To start off, we will import a bunch of stuff from the `twitter` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from nltk.twitter import Query, Streamer, Twitter, TweetViewer, TweetWriter, credsfromfile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following example, you'll see the line\n",
    "``` python\n",
    "oauth = credsfromfile()\n",
    "```\n",
    "This gets hold of your stored API key information. The function `credsfromfile()` by default looks for a file called `credentials.txt` in the directory set by the environment variable `TWITTER`, reads the contents and returns the result as a dictionary. We then pass this dictionary as an argument when initializing our client code. We'll be using two classes to wrap the clients: `Streamer` and `Query`; the first of these calls [the Streaming API](https://dev.twitter.com/streaming/overview) and the second calls Twitter's [Search API](https://dev.twitter.com/rest/public) (also called the REST API). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*More detail*: For more detail, see this blog post on [The difference between the Twitter Firehose API, the Twitter Search API, and the Twitter Streaming API](http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After initializing a client, we call the `register()` method to specify whether we want to view the data on a terminal or write it to a file. Finally, we call a method which determines the API endpoint to address; in this case, we use `sample()` to get a random sample from the the Streaming API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RT @EPVLatino: ¿Y entonces? El gobierno sigue importando carros mientras las plantas Chery los tiene acumulados http://t.co/bBhrawqHe7\n",
      "RT @AbrahamMateoMus: . @menudanochetv aquí se suman nuestros Abrahamers MEXICAN@S!! 👏 \n",
      "#MenudaNocheConAM 😉 http://t.co/8DMw31wZ5i\n",
      "RT @Joeyclipstar: ** FRESH ** Bow Wow Signs to Bad Boy Records - The Breakfast Club http://t.co/3w58p6Sbx2 RT http://t.co/LbQU2brfpf\n",
      "#شاركونا\n",
      "اشي مستحيل تكمل يومكك بدونه ... ؟ 🌚\n",
      "#Manal\n",
      "RT @techjunkiejh: MEAN Stack Tutorial #mongodb #ExpressJS #angularjs #nodejs #javascript http://t.co/4gTFsj2dtP http://t.co/a86hmb4mRx\n",
      "Only @MariamDiamond would reply to a spider on twitter 😂😂\n",
      "RT @CJLeBlanc: @SeanCarrigan greets the new day..full spirit, verve and no small amount of vodka! GO TEAM #YR! … http://t.co/bQIglZVDxR\n",
      "んぐぅおぉ、はらみーライブ楽しかったようで何より。行きたかったンゴ〜\n",
      "RT @NicoleRaine8: @maine_richards @MaLuisaMiranda1 count me in ngkakape nyahaha #ALDubEBforLOVE\n",
      "RT @RadioDelPlata: [AHORA] \"Me amputaron los 4 miembros\" Perla Pascarelli sobre #Malapraxis a #MónicayCésar http://t.co/StUhpxDeM3\n",
      "Written 10 Tweets\n"
     ]
    }
   ],
   "source": [
    "oauth = credsfromfile()\n",
    "client = Streamer(**oauth)\n",
    "client.register(TweetViewer(limit=10))\n",
    "client.sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next example is similar, except that we call the `filter()` method with the `track` parameter followed by a string literal. The string is interpreted as a list of search terms where [comma indicates a logical OR](https://dev.twitter.com/streaming/overview/request-parameters#track). The terms are treated as case-insensitive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "European countries at heart of refugee crisis seek to ease tensions: Hungary announces removal of razor wire f... http://t.co/PavCKddtY2\n",
      "RT @onlinewweman: Germany told students to wear \"modest clothing\" bc they don't want the refugees to have \"misunderstandings.\" That's a wei…\n",
      "RT @El_consciente: El cuento ha cambiado. A pinocho le crecía la nariz si mentía. A los políticos europeos sus fortunas. Made in Germany ht…\n",
      "VIDEO=&gt; Finns Attack “Refugee” Bus with Rocks and Fireworks – Refugees Turn Back to Sweden https://t.co/94KqhyCNjJ http://t.co/e3kmeGjRFn\n",
      "RT @El_consciente: Merkel al volante de Europa. Fabricación en cadena de productos fraudulentos. Made in Germany http://t.co/SJ5BYQ7lIu htt…\n",
      "European countries at heart of refugee crisis seek to ease tensions: Hungary announces rem... http://t.co/5BmOYNK3Kj (via @EricBarbosa11\n",
      "@SirCorgis @matty_is @RT_com but will Poland blame the ppl actually causing the refugee crisis? Cause and effect is a bitch innit?\n",
      "RT @El_consciente: Merkel al volante de Europa. Fabricación en cadena de productos fraudulentos. Made in Germany http://t.co/SJ5BYQ7lIu htt…\n",
      "♥ https://t.co/CyoWdON0li\n",
      "RT @mjesusgz: Castle Germany http://t.co/scs5dJE1Gk\n",
      "Written 10 Tweets\n"
     ]
    }
   ],
   "source": [
    "client = Streamer(**oauth)\n",
    "client.register(TweetViewer(limit=10))\n",
    "client.filter(track='refugee, germany')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whereas the Streaming API lets us access near real-time Twitter data, the Search API lets us query for past Tweets. In the following example, the value `tweets` returned by `search_tweets()` is a generator; the expression `next(tweets)` gives us the first Tweet from the generator. \n",
    "\n",
    "Although Twitter delivers Tweets as [JSON](http://www.json.org) objects, the Python client encodes them as dictionaries, and the example pretty-prints a portion of the dictionary corresponding the Tweet in question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'contributors': None,\n",
      " 'coordinates': None,\n",
      " 'created_at': 'Sat Sep 26 14:25:12 +0000 2015',\n",
      " 'entities': {...},\n",
      " 'favorite_count': 0,\n",
      " 'favorited': False,\n",
      " 'geo': None,\n",
      " 'id': 647778955005665280,\n",
      " 'id_str': '647778955005665280',\n",
      " 'in_reply_to_screen_name': None,\n",
      " 'in_reply_to_status_id': None,\n",
      " 'in_reply_to_status_id_str': None,\n",
      " 'in_reply_to_user_id': None,\n",
      " 'in_reply_to_user_id_str': None,\n",
      " 'is_quote_status': False,\n",
      " 'lang': 'en',\n",
      " 'metadata': {...},\n",
      " 'place': None,\n",
      " 'possibly_sensitive': False,\n",
      " 'retweet_count': 0,\n",
      " 'retweeted': False,\n",
      " 'source': '<a href=\"http://www.techwars.io\" rel=\"nofollow\">TechWars</a>',\n",
      " 'text': 'We compared #gate vs #nltk - see results: http://t.co/jvQ4Ph85L1',\n",
      " 'truncated': False,\n",
      " 'user': {...}}\n"
     ]
    }
   ],
   "source": [
    "client = Query(**oauth)\n",
    "tweets = client.search_tweets(keywords='nltk', limit=10)\n",
    "tweet = next(tweets)\n",
    "from pprint import pprint\n",
    "pprint(tweet, depth=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Twitter's own documentation [provides a useful overview of all the fields in the JSON object](https://dev.twitter.com/overview/api/tweets) and it may be helpful to look at this [visual map of a Tweet object](http://www.scribd.com/doc/30146338/map-of-a-tweet).\n",
    "\n",
    "Since each Tweet is converted into a Python dictionary, it's straightforward to just show a selected field, such as the value of the `'text'` key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Slammer an immigration lawyer seattle wa protection if purusha this morning polaric deportation?: Nltk\n",
      "Python Text Processing with NLTK 2.0 Cookbook / Jacob Perkins\n",
      "http://t.co/0gUjlTWA7G\n",
      "\n",
      "49\n",
      "RT @tjowens: DHbox http://t.co/skIzU3Nm6C \"Ready-to-go configurations of Omeka, NLTK, IPython, R Studio, and Mallet\" #odh2015 http://t.co/6…\n",
      "RT @tjowens: DHbox http://t.co/skIzU3Nm6C \"Ready-to-go configurations of Omeka, NLTK, IPython, R Studio, and Mallet\" #odh2015 http://t.co/6…\n",
      "RT @tjowens: DHbox http://t.co/skIzU3Nm6C \"Ready-to-go configurations of Omeka, NLTK, IPython, R Studio, and Mallet\" #odh2015 http://t.co/6…\n",
      "RT @tjowens: DHbox http://t.co/skIzU3Nm6C \"Ready-to-go configurations of Omeka, NLTK, IPython, R Studio, and Mallet\" #odh2015 http://t.co/6…\n",
      "RT @ideaofhappiness: Interesting! @DH_Box is a Docker container for digital humanities computational work, pre-equipped with IPython, RStud…\n",
      "RT @ideaofhappiness: Interesting! @DH_Box is a Docker container for digital humanities computational work, pre-equipped with IPython, RStud…\n",
      "RT @dimazest: Stanford dependency parser support is merged into @NLTK_org https://t.co/aN6b1lFGPf\n"
     ]
    }
   ],
   "source": [
    "for tweet in tweets:\n",
    "    print(tweet['text'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing to /Users/ewan/twitter-files/tweets.20150926-154337.json\n"
     ]
    }
   ],
   "source": [
    "client = Query(**oauth)\n",
    "client.register(TweetWriter())\n",
    "client.user_tweets('timoreilly', 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given a list of user IDs, the following example shows how to retrieve the screen name and other information about the users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CNN, followers: 19806095, following: 1102\n",
      "BBCNews, followers: 4935491, following: 105\n",
      "ReutersLive, followers: 307337, following: 55\n",
      "BreakingNews, followers: 7949242, following: 541\n",
      "AJELive, followers: 1117, following: 19\n"
     ]
    }
   ],
   "source": [
    "userids = ['759251', '612473', '15108702', '6017542', '2673523800']\n",
    "client = Query(**oauth)\n",
    "user_info = client.user_info_from_id(userids)\n",
    "for info in user_info:\n",
    "    name = info['screen_name']\n",
    "    followers = info['followers_count']\n",
    "    following = info['friends_count']\n",
    "    print(\"{}, followers: {}, following: {}\".format(name, followers, following))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A list of user IDs can also be used as input to the Streaming API client."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @bbcweather: Cameras at the ready for #supermoon #eclipse on 27/28th Sept, next one won't be until 2033! http://t.co/SPucnmBqaD http://t…\n",
      "RT @BreakingNews: Alleged Libya-Europe people smuggler killed in shootout, Libya officials say Italy behind asssassination - @guardian http…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "@CNN white water, Monica, emails, Benghazi. A family/foundation of lies and crime. Indict Hillary for breaking laws\n",
      "RT @CNN: Bill Clinton on email scrutiny: 'I've never seen so much expended on so little.'\n",
      "http://t.co/XkLP0IHeOG\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @CNN: Sunday's #supermoon #eclipse has some excited, but for others, it's an ominous \"blood moon.\" http://t.co/2B1wdQru0q http://t.co/Aw…\n",
      "RT @BreakingNews: Alleged Libya-Europe people smuggler killed in shootout, Libya officials say Italy behind asssassination - @guardian http…\n",
      "Written 10 Tweets\n"
     ]
    }
   ],
   "source": [
    "client = Streamer(**oauth)\n",
    "client.register(TweetViewer(limit=10))\n",
    "client.statuses.filter(follow=userids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To store data that Twitter sents by the Streaming API, we register a `TweetWriter` instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing to /Users/ewan/twitter-files/tweets.20150926-154408.json\n",
      "Written 10 Tweets\n"
     ]
    }
   ],
   "source": [
    "client = Streamer(**oauth)\n",
    "client.register(TweetWriter(limit=10))\n",
    "client.statuses.sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the full signature of the `Tweetwriter`'s `__init__()` method:\n",
    "```python\n",
    "def __init__(self, limit=2000, upper_date_limit=None, lower_date_limit=None,\n",
    "                 fprefix='tweets', subdir='twitter-files', repeat=False,\n",
    "                 gzip_compress=False):                \n",
    "```\n",
    "If the `repeat` parameter is set to `True`, then the writer will write up to the value of `limit` in file `file1`, then open a new file `file2` and write to it until the limit is reached, and so on indefinitely. The parameter `gzip_compress` can be used to compress the files once they have been written."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <a name=\"corpus_reader\">Using a Tweet Corpus</a>\n",
    "\n",
    "NLTK's Twitter corpus currently contains a sample of 20k Tweets (named '`twitter_samples`')\n",
    "retrieved from the Twitter Streaming API, together with another 10k which are divided according to sentiment into negative and positive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import twitter_samples\n",
    "twitter_samples.fileids()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We follow standard practice in storing full Tweets as line-separated\n",
    "JSON. These data structures can be accessed via `tweets.docs()`. However, in general it\n",
    "is more practical to focus just on the text field of the Tweets, which\n",
    "are accessed via the `strings()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP\n",
      "VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY\n",
      "RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…\n",
      "RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1\n",
      "RT @thesundaypeople: UKIP's housing spokesman rakes in £800k in housing benefit from migrants.  http://t.co/GVwb9Rcb4w http://t.co/c1AZxcLh…\n",
      "RT @Nigel_Farage: Make sure you tune in to #AskNigelFarage tonight on BBC 1 at 22:50! #UKIP http://t.co/ogHSc2Rsr2\n",
      "RT @joannetallis: Ed Milliband is an embarrassment. Would you want him representing the UK?!  #bbcqt vote @Conservatives\n",
      "RT @abstex: The FT is backing the Tories. On an unrelated note, here's a photo of FT leader writer Jonathan Ford (next to Boris) http://t.c…\n",
      "RT @NivenJ1: “@George_Osborne: Ed Miliband proved tonight why he's not up to the job” Tbf you've spent 5 years doing that you salivating do…\n",
      "LOLZ to Trickle Down Wealth. It's never trickling past their own wallets. Greed always wins $$$ for the greedy.  https://t.co/X7deoPbS97\n",
      "SNP leader faces audience questions http://t.co/TYClKltSpW\n",
      "RT @cononeilluk: Cameron \"Ed Milliband hanging out with Russell Brand. He is a joke. This is an election. This is about real people' http:/…\n",
      "RT @politicshome: Ed Miliband: Last Labour government did not overspend http://t.co/W9RJ2aSH6o http://t.co/4myFekg5ex\n",
      "If Miliband is refusing to do any deal with the SNP, how does he plan on forming a government?\n",
      "RT @scotnotbritt: Well thats it. LABOUR would rather have a TORY government rather than work with the SNP. http://t.co/SNMkRDCe9f\n"
     ]
    }
   ],
   "source": [
    "strings = twitter_samples.strings('tweets.20150430-223406.json')\n",
    "for string in strings[:15]:\n",
    "    print(string)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default tokenizer for Tweets (`casual.py`) is specialised for 'casual' text, and\n",
    "the `tokenized()` method returns a list of lists of tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['RT', '@KirkKus', ':', 'Indirect', 'cost', 'of', 'the', 'UK', 'being', 'in', 'the', 'EU', 'is', 'estimated', 'to', 'be', 'costing', 'Britain', '£', '170', 'billion', 'per', 'year', '!', '#BetterOffOut', '#UKIP']\n",
      "['VIDEO', ':', 'Sturgeon', 'on', 'post-election', 'deals', 'http://t.co/BTJwrpbmOY']\n",
      "['RT', '@LabourEoin', ':', 'The', 'economy', 'was', 'growing', '3', 'times', 'faster', 'on', 'the', 'day', 'David', 'Cameron', 'became', 'Prime', 'Minister', 'than', 'it', 'is', 'today', '..', '#BBCqt', 'http://t.co…']\n",
      "['RT', '@GregLauder', ':', 'the', 'UKIP', 'east', 'lothian', 'candidate', 'looks', 'about', '16', 'and', 'still', 'has', 'an', 'msn', 'addy', 'http://t.co/7eIU0c5Fm1']\n",
      "['RT', '@thesundaypeople', ':', \"UKIP's\", 'housing', 'spokesman', 'rakes', 'in', '£', '800k', 'in', 'housing', 'benefit', 'from', 'migrants', '.', 'http://t.co/GVwb9Rcb4w', 'http://t.co/c1AZxcLh…']\n"
     ]
    }
   ],
   "source": [
    "tokenized = twitter_samples.tokenized('tweets.20150430-223406.json')\n",
    "for toks in tokenized[:5]:\n",
    "    print(toks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Extracting Parts of a Tweet\n",
    "\n",
    "If we want to carry out other kinds of analysis on Tweets, we have to work directly with the file rather than via the corpus reader. For demonstration purposes, we will use the same file as the one in the preceding section, namely  `tweets.20150430-223406.json`. The `abspath()` method of the corpus gives us the full pathname of the relevant file. If your NLTK data is installed in the default location on a Unix-like system, this pathname will be `'/usr/share/nltk_data/corpora/twitter_samples/tweets.20150430-223406.json'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nltk.corpus import twitter_samples\n",
    "input_file = twitter_samples.abspath(\"tweets.20150430-223406.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The function `json2csv()` takes as input a file-like object consisting of Tweets as line-delimited JSON objects and returns a file in CSV format. The third parameter of the function lists the fields that we want to extract from the JSON. One of the simplest examples is to extract just the text of the Tweets (though of course it would have been even simpler to use the `strings()` method of the corpus reader)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nltk.twitter.common import json2csv\n",
    "with open(input_file) as fp:\n",
    "    json2csv(fp, 'tweets_text.csv', ['text'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've passed the filename `'tweets_text.csv'` as the second argument of `json2csv()`. Unless you provide a complete pathname, the file will be created in the directory where you are currently executing Python.\n",
    "\n",
    "If you open the file `'tweets_text.csv'`, the first 5 lines should look as follows:\n",
    "\n",
    "```\n",
    "RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP\n",
    "VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY\n",
    "RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…\n",
    "RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1\n",
    "RT @thesundaypeople: UKIP's housing spokesman rakes in £800k in housing benefit from migrants.  http://t.co/GVwb9Rcb4w http://t.co/c1AZxcLh…\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, in some applications you may want to work with Tweet metadata, e.g., the creation date and the user. As mentioned earlier, all the fields of a Tweet object are described in [the official Twitter API](https://dev.twitter.com/overview/api/tweets). \n",
    "\n",
    "The third argument of `json2csv()` can specified so that the function selects relevant parts of the metadata. For example, the following will generate a CSV file including most of the metadata together with the id of the user who has published it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with open(input_file) as fp:\n",
    "    json2csv(fp, 'tweets.20150430-223406.tweet.csv',\n",
    "            ['created_at', 'favorite_count', 'id', 'in_reply_to_status_id', \n",
    "            'in_reply_to_user_id', 'retweet_count', 'retweeted', \n",
    "            'text', 'truncated', 'user.id'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "created_at,favorite_count,id,in_reply_to_status_id,in_reply_to_user_id,retweet_count,retweeted,text,truncated,user.id\n",
      "\n",
      "Thu Apr 30 21:34:06 +0000 2015,0,593891099434983425,,,0,False,RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP,False,107794703\n",
      "\n",
      "Thu Apr 30 21:34:06 +0000 2015,0,593891099548094465,,,0,False,VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY,False,557422508\n",
      "\n",
      "Thu Apr 30 21:34:06 +0000 2015,0,593891099388846080,,,0,False,RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…,False,3006692193\n",
      "\n",
      "Thu Apr 30 21:34:06 +0000 2015,0,593891100429045760,,,0,False,RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1,False,455154030\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for line in open('tweets.20150430-223406.tweet.csv').readlines()[:5]:\n",
    "    print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first nine elements of the list are attributes of the Tweet, while the last one, `user.id`, takes the user object associated with the Tweet, and retrieves the attributes in the list (in this case only the id). The object for the Twitter user is described in the  [Twitter API for users](https://dev.twitter.com/overview/api/users)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rest of the metadata of the Tweet are the so-called [entities](https://dev.twitter.com/overview/api/entities) and [places](https://dev.twitter.com/overview/api/places). The following examples show how to get each of those entities. They all include the id of the Tweet as the first argument, and some of them include also the text for clarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nltk.twitter.common import json2csv_entities\n",
    "with open(input_file) as fp:\n",
    "    json2csv_entities(fp, 'tweets.20150430-223406.hashtags.csv',\n",
    "                        ['id', 'text'], 'hashtags', ['text'])\n",
    "    \n",
    "with open(input_file) as fp:\n",
    "    json2csv_entities(fp, 'tweets.20150430-223406.user_mentions.csv',\n",
    "                        ['id', 'text'], 'user_mentions', ['id', 'screen_name'])\n",
    "    \n",
    "with open(input_file) as fp:\n",
    "    json2csv_entities(fp, 'tweets.20150430-223406.media.csv',\n",
    "                        ['id'], 'media', ['media_url', 'url'])\n",
    "    \n",
    "with open(input_file) as fp:\n",
    "    json2csv_entities(fp, 'tweets.20150430-223406.urls.csv',\n",
    "                        ['id'], 'urls', ['url', 'expanded_url'])\n",
    "    \n",
    "with open(input_file) as fp:\n",
    "    json2csv_entities(fp, 'tweets.20150430-223406.place.csv',\n",
    "                        ['id', 'text'], 'place', ['name', 'country'])\n",
    "\n",
    "with open(input_file) as fp:\n",
    "    json2csv_entities(fp, 'tweets.20150430-223406.place_bounding_box.csv',\n",
    "                        ['id', 'name'], 'place.bounding_box', ['coordinates'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, when a Tweet is actually a retweet, the original tweet can be also fetched from the same file, as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with open(input_file) as fp:\n",
    "    json2csv_entities(fp, 'tweets.20150430-223406.original_tweets.csv',\n",
    "                        ['id'], 'retweeted_status', ['created_at', 'favorite_count', \n",
    "                        'id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweet_count',\n",
    "                        'text', 'truncated', 'user.id'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the first id corresponds to the retweeted Tweet, and the second id to the original Tweet.\n",
    "\n",
    "### Using Dataframes\n",
    "\n",
    "Sometimes it's convenient to manipulate CSV files as tabular data, and this is made easy with the [Pandas](http://pandas.pydata.org/) data analysis library. `pandas` is not currrently one of the dependencies of NLTK, and you will probably have to install it specially.\n",
    "\n",
    "Here is an example of how to read a CSV file into a `pandas` dataframe. We use the `head()` method of a dataframe to just show the first 5 rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>favorite_count</th>\n",
       "      <th>in_reply_to_status_id</th>\n",
       "      <th>in_reply_to_user_id</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>retweeted</th>\n",
       "      <th>text</th>\n",
       "      <th>truncated</th>\n",
       "      <th>user.id</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>593891099434983425</th>\n",
       "      <td>Thu Apr 30 21:34:06 +0000 2015</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>RT @KirkKus: Indirect cost of the UK being in ...</td>\n",
       "      <td>False</td>\n",
       "      <td>107794703</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>593891099548094465</th>\n",
       "      <td>Thu Apr 30 21:34:06 +0000 2015</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>VIDEO: Sturgeon on post-election deals http://...</td>\n",
       "      <td>False</td>\n",
       "      <td>557422508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>593891099388846080</th>\n",
       "      <td>Thu Apr 30 21:34:06 +0000 2015</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>RT @LabourEoin: The economy was growing 3 time...</td>\n",
       "      <td>False</td>\n",
       "      <td>3006692193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>593891100429045760</th>\n",
       "      <td>Thu Apr 30 21:34:06 +0000 2015</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>RT @GregLauder: the UKIP east lothian candidat...</td>\n",
       "      <td>False</td>\n",
       "      <td>455154030</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>593891100768784384</th>\n",
       "      <td>Thu Apr 30 21:34:07 +0000 2015</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>RT @thesundaypeople: UKIP's housing spokesman ...</td>\n",
       "      <td>False</td>\n",
       "      <td>187547338</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        created_at  favorite_count  \\\n",
       "id                                                                   \n",
       "593891099434983425  Thu Apr 30 21:34:06 +0000 2015               0   \n",
       "593891099548094465  Thu Apr 30 21:34:06 +0000 2015               0   \n",
       "593891099388846080  Thu Apr 30 21:34:06 +0000 2015               0   \n",
       "593891100429045760  Thu Apr 30 21:34:06 +0000 2015               0   \n",
       "593891100768784384  Thu Apr 30 21:34:07 +0000 2015               0   \n",
       "\n",
       "                    in_reply_to_status_id  in_reply_to_user_id  retweet_count  \\\n",
       "id                                                                              \n",
       "593891099434983425                    NaN                  NaN              0   \n",
       "593891099548094465                    NaN                  NaN              0   \n",
       "593891099388846080                    NaN                  NaN              0   \n",
       "593891100429045760                    NaN                  NaN              0   \n",
       "593891100768784384                    NaN                  NaN              0   \n",
       "\n",
       "                   retweeted  \\\n",
       "id                             \n",
       "593891099434983425     False   \n",
       "593891099548094465     False   \n",
       "593891099388846080     False   \n",
       "593891100429045760     False   \n",
       "593891100768784384     False   \n",
       "\n",
       "                                                                 text  \\\n",
       "id                                                                      \n",
       "593891099434983425  RT @KirkKus: Indirect cost of the UK being in ...   \n",
       "593891099548094465  VIDEO: Sturgeon on post-election deals http://...   \n",
       "593891099388846080  RT @LabourEoin: The economy was growing 3 time...   \n",
       "593891100429045760  RT @GregLauder: the UKIP east lothian candidat...   \n",
       "593891100768784384  RT @thesundaypeople: UKIP's housing spokesman ...   \n",
       "\n",
       "                   truncated     user.id  \n",
       "id                                        \n",
       "593891099434983425     False   107794703  \n",
       "593891099548094465     False   557422508  \n",
       "593891099388846080     False  3006692193  \n",
       "593891100429045760     False   455154030  \n",
       "593891100768784384     False   187547338  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding=\"utf8\")\n",
    "tweets.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the dataframe it is easy, for example, to first select Tweets with a specific user ID and then retrieve their `'text'` value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id\n",
       "593891099548094465    VIDEO: Sturgeon on post-election deals http://...\n",
       "593891101766918144    SNP leader faces audience questions http://t.c...\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tweets.loc[tweets['user.id'] == 557422508]['text']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expanding a list of Tweet IDs\n",
    "\n",
    "Because the Twitter Terms of Service place severe restrictions on the distribution of Tweets by third parties, a workaround is to instead distribute just the Tweet IDs, which are not subject to the same restrictions. The method `expand_tweetids()` sends a request to the Twitter API to return the full Tweet (in Twitter's terminology, a *hydrated* Tweet) that corresponds to a given Tweet ID. \n",
    "\n",
    "Since Tweets can be deleted by users, it's possible that certain IDs will only retrieve a null value. For this reason, it's safest to use a `try`/`except` block when retrieving values from the fetched Tweet. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Counted 10 Tweet IDs in <_io.StringIO object at 0x107234558>.\n",
      "id: 588665495508766721\n",
      "RT @30SecFlghts: Yep it was bad from the jump https://t.co/6vsFIulyRB\n",
      "\n",
      "id: 588665495487811584\n",
      "@8_s2_5 おかえりなさいまし\n",
      "\n",
      "id: 588665495492124672\n",
      "O link http://t.co/u8yh4xdIAF por @YouTube é o tweet mais popular hoje na minha feed.\n",
      "\n",
      "id: 588665495487844352\n",
      "RT @dam_anison: 【アニサマ2014 LIVEカラオケ⑤】\n",
      "μ'sのライブ映像がDAMに初登場！それは「それは僕たちの奇跡」！\n",
      "μ's結成から5年間の\"キセキ\"を噛み締めながら歌いたい！\n",
      "→http://t.co/ZCAB7jgE4L #anisama http:…\n",
      "\n",
      "id: 588665495513006080\n",
      "[Tweet not available]\n",
      "\n",
      "id: 588665495525588992\n",
      "坂道の時に限って裏の車がめっちゃ車間距離近づけて停めてくるから死ぬかと思った\n",
      "\n",
      "id: 588665495512948737\n",
      "Christina Grimmie #RisingStar\n",
      "17\n",
      "\n",
      "id: 588665495487909888\n",
      "Dolgun Dudaklı Kadınların Çok İyi Bildiği 14 Şey http://t.co/vvEzTlqWOv http://t.co/dsWke4uXQ3\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from io import StringIO\n",
    "ids_f =\\\n",
    "    StringIO(\"\"\"\\\n",
    "    588665495492124672\n",
    "    588665495487909888\n",
    "    588665495508766721\n",
    "    588665495513006080\n",
    "    588665495517200384\n",
    "    588665495487811584\n",
    "    588665495525588992\n",
    "    588665495487844352\n",
    "    88665495492014081\n",
    "    588665495512948737\"\"\")\n",
    "    \n",
    "oauth = credsfromfile()\n",
    "client = Query(**oauth)\n",
    "hydrated = client.expand_tweetids(ids_f)\n",
    "\n",
    "    \n",
    "for tweet in hydrated:            \n",
    "        id_str = tweet['id_str']\n",
    "        print('id: {}'.format(id_str))\n",
    "        text = tweet['text']\n",
    "        if text.startswith('@null'):\n",
    "            text = \"[Tweet not available]\"\n",
    "        print(text + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although we provided the list of IDs as a string in the above example, the standard use case is to pass a file-like object as the argument to `expand_tweetids()`. "
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}